This
sarticle details the high-level steps that can be taken
to recover from particular types of disaster scenarios. As this book
and chapter focuses on Windows Server 2008 R2 environments, so shall
the following sections.
Network Outage
When an organization is faced
with a network outage, the impact can affect a small set of users, an
entire office, or the entire company. When a network outage occurs, the
network administrators should perform the following tasks:
Test the reported
outage to verify if the issue is related to a wide area network (WAN)
connection between the organization and the Internet service provider
(ISP), the router, a network switch, a firewall, a physical fiber or
copper network connection or network port, or line power to any of the
aforementioned devices.
After
the issue is isolated or, at least, the scope of the issue is
understood, the network administrator should communicate the outage to
the necessary managers and/or business owners and, as necessary, open
communication to outside support vendors and ISP contacts to report the
issue and create a trouble ticket. And no—this should not go out in an
email if the network is down.
Create a logical action plan to resolve the issue and execute the plan.
Create
and distribute a summary of the cause and result of the issue and how
it can be avoided in the future. Close the trouble ticket as required.
Physical Site Failure
In the event a physical site
or office cannot be accessed, a number of business operations might be
suspended. Planning how to mitigate issues related to physical site
limitations can be extensive, but should include the considerations
discussed in the following sections.
Physical Site Access Is Limited but Site Is Functional
This section lists a few
considerations for a situation where the site or office cannot be
accessed physically, but all systems are functional:
Can the main and most critical phone lines be accessed or forwarded remotely?
Is
there a remote access solution to allow employees with or without
notebooks/laptop computers to connect to the organization’s network and
perform their work?
Are
there any other business operations that require onsite access that are
tied to a service-level agreement, such as responding to paper faxes or
submitted customer support emails, phone calls, or custom applications?
Physical Site Is Offline and Inaccessible
This section lists a few
considerations for a situation where the resources in a site are
nonfunctional. This scenario assumes that the site resources cannot be
accessed across the network or Internet and the data center is offline
with no chance of a quick recovery. When planning for a scenario such
as this, the following items should be considered:
Can all services
be restored in an alternate capacity—or at least the most critical
systems, such as the main phone lines, fax lines, devices,
applications, system, and remote access services?
If
systems are cut over to an alternate location, what is the impact in
performance, or what percentage of end-user load can the system support?
If systems are cut over to an alternate location, will there be any data loss or will only some data be accessible?
If
the decision to cut over to the alternate location is made, how long
will it take to cut over and restore the critical services?
If
the site outage is caused by power loss or network issues, how long of
an outage should be sustained before deciding to cut over services to
an alternate location?
When
the original system is restored, if possible, what will it take to
failback or cut the systems back to the main location, and is there any
data loss or synchronization of data involved?
These short lists merely
break the surface when it comes to the planning of or dealing with a
physical site outage, but, hopefully, they will spark some dialogue in
the disaster recovery planning process to lead the organization to the
solution that meets their needs and budget.
Server or System Failure
When a server or system
failure occurs, administrators must decide on which recovery plan of
action will be the most effective. Depending on the particular system,
in some cases, it might be more efficient to build a new system and
restore the functionality or data. In other cases, where rebuilding a
system can take several hours, it might be more prudent to troubleshoot
and repair the problem.
Application or Service Failure
If a Windows Server 2008 R2
system is still operational but a particular application or service on
the system is nonfunctional, in most cases troubleshooting and
attempting repair or restoring the system to a previous backup state is
the correct plan of action. The Windows Server 2008 R2 event log is
much more useful of a tool than in previous versions, and it should be
one of the first places an administrator looks to determine the cause
of a validated issue. Following troubleshooting or recovery procedures
for the particular application is the next logical step. For example,
if an end user deleted a folder from a network
share, the preferred recovery method might be to use Shadow Copy
backups to restore the data instead of the Windows Server Backup.
For Windows services,
using Server Manager to review the status of the role and role services
assists administrators in identifying and isolating problems because
the Server Manager tool displays a filtered representation of Event
Viewer items and service state for each role installed on the system. Figure 1 details that the File Services role SERVER10 logged several errors and warnings in the last 24 hours.
Data Corruption or Loss
When a report has been logged
that the data on a server is missing, is corrupted, or has been
overwritten, Windows Server 2008 R2 administrators have a few options
to deal with this situation. Shadow Copies for Shared Folders can be
used to restore previous versions of selected files or folders and
Windows Server Backup can be used to restore selected files, folders,
or the entire volume on a Windows disk. Using Shadow Copies for Shared
Folders, administrators and end users with the correct permissions can
restore data right from their workstation. Using the restore features
of Windows Server Backup, administrators can place the restored data
back into the same folder by overwriting the existing data or placing a
copy of the data with a different name based on the backup schedule
date and time. For example, to restore a file called ClientProprosal.docx that was backed up on 10-9-09 at 12:30 p.m., Windows Server Backup will restore the file as 2009-10-09 12-30 Copy of ClientProposal.docx, and the time representation will be the current time zone of the server.
Hardware Failure
When hardware failure
occurs, a number of issues and symptoms might result. The most common
issues related to hardware failures include system crashes, services or
drivers stopping unexpectedly, frozen (hung) systems, and systems that
are in a constant reboot cycle. When hardware is suspected as failed or
failing on a Windows Server 2008 R2 system, administrators should first
review the event logs for any related system or application event
warnings and errors. If nothing apparent is logged, hardware
manufacturers usually provide several different diagnostic utilities
that can be used to test and verify hardware configuration and
functional state. Don’t wait to call Microsoft and involve their
professional support services department because they can be working in
conjunction with your team to capture and review debugging data.
When a system is
suspected of having hardware issues and it is a business-critical
system, steps should be taken to migrate services or applications
hosted on that system to an alternate production system, or the system
should be recovered to new hardware. Windows Server 2008 R2 can
tolerate a full system restore or a complete PC restore to alternate
hardware if the system is an exact or close hardware match with regard
to the motherboard, processors, hard disk controller, and network card.
Even if the hardware is exact and the disk arrays, disk IDs, and volume
or partition numbers do not match, a complete PC restore to alternate
hardware might fail if no additional steps are taken during the restore
or recovery process.